Creating compelling captions for data visualizations has been a longstanding challenge. Visualization researchers are typically untrained in journalistic reporting and hence the captions that are placed below data visualizations tend to be not overly engaging and rather just stick to basic observations about the data. In this work we explore the opportunities offered by the newly emerging crop of large language models (LLM) which use sophisticated deep learning technology to produce human-like prose. We ask, can these powerful software devices be purposed to produce engaging captions for generic data visualizations like a scatterplot. It turns out that the key challenge lies in designing the most effective prompt for the LLM, a task called prompt engineering. We report on first experiments using the popular LLM GPT-3 and deliver some promising results.
translated by 谷歌翻译
随着人工智能的兴起,算法已经变得更好地从培训数据中学习基本模式,包括基于性别,种族等基于性别的社会偏见。部署此类算法对招聘,医疗保健,执法等领域的部署已经提高了严重的领域。对机器学习算法中的公平,问责制,信任和解释性的关注。为了减轻这个问题,我们提出了D-Bias,这是一种视觉交互式工具,它体现了人类在循环AI方法,以审核和减轻表格数据集的社交偏见。它使用图形因果模型来表示数据集中不同特征之间的因果关系,并作为注入域知识的媒介。用户可以通过识别因果网络中的不公平因果关系并使用一系列公平指标来检测对群体(例如女性或亚组)的偏见。此后,用户可以通过在不公平的因果边缘作用来减轻偏见。对于每种相互作用,例如弱化/删除有偏见的因果边缘,系统使用一种新方法来模拟基于当前因果模型的新(cla依)数据集。用户可以在视觉上评估其相互作用对不同公平指标,公用事业指标,数据失真和基础数据分布的影响。一旦满足,他们就可以下载依据的数据集并将其用于任何下游应用程序以进行更公正的预测。我们通过对3个数据集进行实验以及一项正式的用户研究来评估D偏差。我们发现,与不同公平指标的基线偏差方法相比,D偏差有助于显着降低偏差,同时几乎没有数据失真和效用较小的损失。此外,我们基于人类的方法极大地超过了关于信任,解释性和问责制的自动方法。
translated by 谷歌翻译
了解机器学习(ML)管道不同阶段的多重公平性增强干预措施的累积效应是公平文献的关键且毫无疑问的方面。这些知识对于数据科学家/ML从业人员设计公平的ML管道可能很有价值。本文通过进行了一项广泛的经验研究迈出了探索该领域的第一步,其中包括60种干预措施,9个公平指标,2个公用事业指标(准确性和F1得分),跨4个基准数据集。我们定量分析实验数据,以衡量多种干预措施对公平,公用事业和人口群体的影响。我们发现,采用多种干预措施会导致更好的公平性和更低的效用,而不是个人干预措施。但是,添加更多的干预措施并不总是会导致更好的公平或更差的公用事业。达到高性能(F1得分)以及高公平的可能性随大的干预措施增加。不利的一面是,我们发现提高公平的干预措施会对不同的人群群体,尤其是特权群体产生负面影响。这项研究强调了对新的公平指标的必要性,这些指标是对不同人口群体的影响,除了群体之间的差异。最后,我们提供了一系列干预措施的列表,这些措施为不同的公平和公用事业指标做得最好,以帮助设计公平的ML管道。
translated by 谷歌翻译
尽管使用StyleGan进行语义操纵的最新进展,但对真实面孔的语义编辑仍然具有挑战性。 $ W $空间与$ W $+空间之间的差距需要重建质量与编辑质量之间的不良权衡。为了解决这个问题,我们建议通过用基于注意的变压器代替Stylegan映射网络中的完全连接的层来扩展潜在空间。这种简单有效的技术将上述两个空间整合在一起,并将它们转换为一个名为$ W $ ++的新的潜在空间。我们的修改后的Stylegan保持了原始StyleGan的最新一代质量,并具有中等程度的多样性。但更重要的是,提议的$ W $ ++空间在重建质量和编辑质量方面都取得了卓越的性能。尽管有这些显着优势,但我们的$ W $ ++空间支持现有的反转算法和编辑方法,仅由于其与$ w/w $+空间的结构相似性,因此仅可忽略不计的修改。 FFHQ数据集上的广泛实验证明,我们提出的$ W $ ++空间显然比以前的$ w/w $+空间更可取。该代码可在https://github.com/anonsubm2021/transstylegan上公开提供。
translated by 谷歌翻译
深度学习领域的最新进展表明,非常大的神经网络在几种应用中的有效性。但是,随着这些深度神经网络的大小不断增长,配置其许多参数以获得良好的结果变得越来越困难。目前,分析师必须尝试许多不同的配置和参数设置,这些配置和参数设置是劳动密集型且耗时的。另一方面,没有人类专家的领域知识,用于神经网络架构搜索的完全自动化技术的能力受到限制。为了解决问题,我们根据单次体系结构搜索技术制定神经网络体系结构优化的任务作为图形空间探索。在这种方法中,对所有候选体系结构的超级绘制进行了一次训练,并将最佳神经网络确定为子图。在本文中,我们提出了一个框架,该框架允许分析师有效地构建解决方案子图形空间,并通过注入其域知识来指导网络搜索。从由基本神经网络组件组成的网络体系结构空间开始,分析师有权通过我们的单发搜索方案有效地选择最有希望的组件。以迭代方式应用此技术使分析师可以为给定应用程序收敛到最佳性能的神经网络体系结构。在探索过程中,分析师可以利用其域知识在搜索空间的散点图可视化中提供的线索来帮助编辑不同的组件,并指导搜索更快的融合。我们与几位深度学习研究人员合作设计了界面,并通过用户研究和两个案例研究来评估其最终有效性。
translated by 谷歌翻译
As language models (LMs) scale, they develop many novel behaviors, good and bad, exacerbating the need to evaluate how they behave. Prior work creates evaluations with crowdwork (which is time-consuming and expensive) or existing data sources (which are not always available). Here, we automatically generate evaluations with LMs. We explore approaches with varying amounts of human effort, from instructing LMs to write yes/no questions to making complex Winogender schemas with multiple stages of LM-based generation and filtering. Crowdworkers rate the examples as highly relevant and agree with 90-100% of labels, sometimes more so than corresponding human-written datasets. We generate 154 datasets and discover new cases of inverse scaling where LMs get worse with size. Larger LMs repeat back a dialog user's preferred answer ("sycophancy") and express greater desire to pursue concerning goals like resource acquisition and goal preservation. We also find some of the first examples of inverse scaling in RL from Human Feedback (RLHF), where more RLHF makes LMs worse. For example, RLHF makes LMs express stronger political views (on gun rights and immigration) and a greater desire to avoid shut down. Overall, LM-written evaluations are high-quality and let us quickly discover many novel LM behaviors.
translated by 谷歌翻译
Targeted syntactic evaluations of language models ask whether models show stable preferences for syntactically acceptable content over minimal-pair unacceptable inputs. Most targeted syntactic evaluation datasets ask models to make these judgements with just a single context-free sentence as input. This does not match language models' training regime, in which input sentences are always highly contextualized by the surrounding corpus. This mismatch raises an important question: how robust are models' syntactic judgements in different contexts? In this paper, we investigate the stability of language models' performance on targeted syntactic evaluations as we vary properties of the input context: the length of the context, the types of syntactic phenomena it contains, and whether or not there are violations of grammaticality. We find that model judgements are generally robust when placed in randomly sampled linguistic contexts. However, they are substantially unstable for contexts containing syntactic structures matching those in the critical test content. Among all tested models (GPT-2 and five variants of OPT), we significantly improve models' judgements by providing contexts with matching syntactic structures, and conversely significantly worsen them using unacceptable contexts with matching but violated syntactic structures. This effect is amplified by the length of the context, except for unrelated inputs. We show that these changes in model performance are not explainable by simple features matching the context and the test inputs, such as lexical overlap and dependency overlap. This sensitivity to highly specific syntactic features of the context can only be explained by the models' implicit in-context learning abilities.
translated by 谷歌翻译
The number of international benchmarking competitions is steadily increasing in various fields of machine learning (ML) research and practice. So far, however, little is known about the common practice as well as bottlenecks faced by the community in tackling the research questions posed. To shed light on the status quo of algorithm development in the specific field of biomedical imaging analysis, we designed an international survey that was issued to all participants of challenges conducted in conjunction with the IEEE ISBI 2021 and MICCAI 2021 conferences (80 competitions in total). The survey covered participants' expertise and working environments, their chosen strategies, as well as algorithm characteristics. A median of 72% challenge participants took part in the survey. According to our results, knowledge exchange was the primary incentive (70%) for participation, while the reception of prize money played only a minor role (16%). While a median of 80 working hours was spent on method development, a large portion of participants stated that they did not have enough time for method development (32%). 25% perceived the infrastructure to be a bottleneck. Overall, 94% of all solutions were deep learning-based. Of these, 84% were based on standard architectures. 43% of the respondents reported that the data samples (e.g., images) were too large to be processed at once. This was most commonly addressed by patch-based training (69%), downsampling (37%), and solving 3D analysis tasks as a series of 2D tasks. K-fold cross-validation on the training set was performed by only 37% of the participants and only 50% of the participants performed ensembling based on multiple identical models (61%) or heterogeneous models (39%). 48% of the respondents applied postprocessing steps.
translated by 谷歌翻译
As AI systems become more capable, we would like to enlist their help to supervise other AIs. We experiment with methods for training a harmless AI assistant through self-improvement, without any human labels identifying harmful outputs. The only human oversight is provided through a list of rules or principles, and so we refer to the method as 'Constitutional AI'. The process involves both a supervised learning and a reinforcement learning phase. In the supervised phase we sample from an initial model, then generate self-critiques and revisions, and then finetune the original model on revised responses. In the RL phase, we sample from the finetuned model, use a model to evaluate which of the two samples is better, and then train a preference model from this dataset of AI preferences. We then train with RL using the preference model as the reward signal, i.e. we use 'RL from AI Feedback' (RLAIF). As a result we are able to train a harmless but non-evasive AI assistant that engages with harmful queries by explaining its objections to them. Both the SL and RL methods can leverage chain-of-thought style reasoning to improve the human-judged performance and transparency of AI decision making. These methods make it possible to control AI behavior more precisely and with far fewer human labels.
translated by 谷歌翻译
Traditional multi-task learning architectures train a single model across multiple tasks through a shared encoder followed by task-specific decoders. Learning these models often requires specialized training algorithms that address task-conflict in the shared parameter updates, which otherwise can lead to negative transfer. A new type of multi-task learning within NLP homogenizes multi-task architectures as a shared encoder and language model decoder, which does surprisingly well across a range of diverse tasks. Does this new architecture suffer from task-conflicts that require specialized training algorithms? We study how certain factors in the shift towards text-to-text models affects multi-task conflict and negative transfer, finding that both directional conflict and transfer are surprisingly constant across architectures.
translated by 谷歌翻译